Skip to content

Extend power analysis to all estimators + simulation-based MDE/sample size#208

Open
igerber wants to merge 11 commits intomainfrom
power-analysis
Open

Extend power analysis to all estimators + simulation-based MDE/sample size#208
igerber wants to merge 11 commits intomainfrom
power-analysis

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Mar 18, 2026

Summary

  • Add estimator registry mapping all 12 built-in estimators (excluding BaconDecomposition) to their appropriate DGP, fit kwargs, and result extraction profiles
  • Refactor simulate_power() from hardcoded 4-estimator if/elif chain to registry-based lookup supporting all estimators automatically
  • Add simulate_mde() — bisection search over effect sizes to find the minimum detectable effect for any estimator
  • Add simulate_sample_size() — bisection search over n_units to find the required sample size for any estimator
  • Add SimulationMDEResults and SimulationSampleSizeResults dataclasses with summary(), to_dict(), to_dataframe() methods
  • 34 new tests: registry unit tests, per-estimator coverage tests, MDE search tests, sample size search tests

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — no methodology changes, this extends power analysis infrastructure only
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: tests/test_power.py — 34 new tests across 4 test classes (TestEstimatorRegistry, TestEstimatorCoverage, TestSimulateMDE, TestSimulateSampleSize)
  • All 76 tests pass (4 skipped: matplotlib, 2 deselected: slow-marked TROP/SDiD)
  • black, ruff check, and mypy all pass clean
  • Backtest / simulation / notebook evidence: N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link

Overall Assessment

⛔ Blocker

Highest unmitigated severity is P0.

Executive Summary

  • The new ContinuousDiD registry profile is methodologically wrong for power analysis: it varies the DGP slope, but simulate_power() treats that scalar as the estimator’s scalar truth, even though the estimator returns overall_att. With defaults, treatment_effect=0 is still a non-null design.
  • The new simulate_mde() / simulate_sample_size() explicit-range paths can return unchecked upper bounds and power_at_* = 0.0 without ever evaluating the endpoint, because the supplied bracket is neither validated nor fully evaluated.
  • The added tests are mostly smoke tests. They do not cover ContinuousDiD null calibration or the explicit effect_range / n_range branches, so both defects can merge undetected.
  • I did not find a mitigating Methodology Registry note or TODO.md entry for either issue.

Path to Approval

  1. Remove ContinuousDiD from the automatic power registry for this PR, or redesign that profile so the simulated DGP, extracted estimate, and reported true_effect all refer to the same documented scalar estimand.
  2. In simulate_mde() and simulate_sample_size(), always evaluate user-supplied bracket endpoints, validate that the target power is actually bracketed, and never default power_at_* to 0.0 for unevaluated points.
  3. Add regression tests for ContinuousDiD under a null effect and for explicit effect_range / n_range edge cases, including narrow and unbracketed ranges.

Methodology

  • Severity: P0. Impact: The new ContinuousDiD power profile does not simulate the same estimand it evaluates. _continuous_dgp_kwargs() maps treatment_effect into the DGP’s att_slope only, while the generator keeps att_intercept=1.0 by default and defines linear treatment effects as att_intercept + att_slope * d. But _extract_continuous() pulls result.overall_att, and simulate_power() stores the input scalar as true_effect for bias/coverage. The Methodology Registry defines ContinuousDiD targets as ATT(d), ATT^{glob}, and ACRT(d), not the slope parameter. So this profile is wrong in two ways: treatment_effect=0 is still non-null by default, and even with att_intercept=0, the reported truth is still not the returned estimand. Concrete fix: either drop ContinuousDiD from _get_registry() for now, or add a ContinuousDiD-specific power API that targets a documented scalar quantity and computes the truth from the generated dose-response, not from the input slope. Evidence: diff_diff/power.py:127, diff_diff/power.py:273, diff_diff/power.py:1538, diff_diff/prep_dgp.py:780, diff_diff/prep_dgp.py:892, docs/methodology/REGISTRY.md:459

Code Quality

  • Severity: P1. Impact: The new bisection helpers do not correctly handle explicit search ranges. When effect_range / n_range is provided, the code just assigns lo, hi and skips endpoint evaluation. best_power is then initialized from an empty search_path to 0.0, and simulate_sample_size() can even exit before simulating anything when hi - lo <= 2. That means these functions can return an unchecked upper bound labeled as mde / required_n, together with fabricated power_at_mde / power_at_n = 0.0. Concrete fix: evaluate both endpoints whenever a bracket is supplied, require or explicitly warn/error if the target is not bracketed, and only report power_at_* for values that were actually simulated. Evidence: diff_diff/power.py:1842, diff_diff/power.py:1879, diff_diff/power.py:2016, diff_diff/power.py:2033, diff_diff/power.py:2036

Performance

No material findings.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate tech-debt-only findings. The issues above are not tracked in TODO.md:38, so they remain unmitigated for assessment purposes.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The added tests do not exercise the new failure modes. There is an existing DID null-effect calibration test, but the new ContinuousDiD coverage only checks that a result object is finite, and all new MDE/sample-size tests are DID-only. There is also no test for explicit effect_range / n_range inputs. Concrete fix: add tests that assert ContinuousDiD null behavior is calibrated or explicitly unsupported, plus tests that explicit brackets are evaluated/validated and that unbracketed ranges fail loudly. Evidence: tests/test_power.py:390, tests/test_power.py:980, tests/test_power.py:1001, tests/test_power.py:1106

Review note: I could not execute the test suite in this environment because the local Python environment is missing numpy, so this is a static diff review.

igerber added a commit that referenced this pull request Mar 18, 2026
…bisection brackets

P0: Remove ContinuousDiD from power analysis registry — the DGP simulated
att_slope as treatment_effect but extracted overall_att (a dose-weighted
average), so the simulated truth and evaluated estimand didn't match.

P1: Add endpoint evaluation for user-supplied effect_range/n_range in
simulate_mde() and simulate_sample_size(). Previously these paths skipped
evaluation entirely, leaving search_path empty and best_power at 0.0.
Now warns when brackets don't contain the target power.

P2: Add regression tests for all three fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 11539e7a01a24a990ff517c75c04099557578681


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

  • The prior ContinuousDiD blocker is resolved: it is no longer auto-registered for simulation power, and there is now a regression test for that path.
  • The new automatic TwoWayFixedEffects registry entry is still methodologically wrong for the default multi-period simulation path. It passes time="period" into an estimator that explicitly warns this does not estimate the standard DiD ATT.
  • The explicit-endpoint fix is only partial. simulate_sample_size() still never uses the lower-endpoint power, so it can return an oversized required_n when the lower bound already achieves target power.
  • The new tests are mostly smoke tests. They do not lock down either the TWFE methodology issue or the lower-bound sample-size bug.

Methodology

  • Severity: P1. Impact: the new TWFE registry profile is not aligned with the estimator’s documented estimand. _twfe_fit_kwargs() wires automatic power to time="period", but TwoWayFixedEffects.fit() explicitly warns that multi-valued time inputs estimate treated * period_number, not the standard treated * post ATT. That means the new “all built-in estimators are supported automatically” claim is false for TWFE under the default n_periods=4, and the resulting simulate_power(), simulate_mde(), and simulate_sample_size() outputs for TWFE are not methodologically valid. Concrete fix: do not auto-register TWFE for multi-period power simulation in its current form. Either restrict automatic TWFE support to validated two-period / binary-post designs, or change the estimator/power interface so time fixed effects and the post indicator are passed separately before claiming automatic support. Evidence: diff_diff/power.py:L139, diff_diff/power.py:L267, diff_diff/power.py:L1224, diff_diff/twfe.py:L96, docs/methodology/REGISTRY.md:L238, docs/methodology/REGISTRY.md:L253

Code Quality

  • Severity: P1. Impact: simulate_sample_size() still violates the basic bisection invariant because it evaluates the lower endpoint only for logging and never checks whether that lower bound already meets target power. If lo already satisfies the target, the function still bisects above it and can silently return a too-large required_n; the same bug exists in the auto-bracketing path when min_n is already sufficient. Concrete fix: evaluate and store power_lo, return lo immediately when power_lo >= power, and only enter bisection once the invariant power(lo) < target <= power(hi) holds. Evidence: diff_diff/power.py:L1988, diff_diff/power.py:L1999, diff_diff/power.py:L2016

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate deferred-work findings. These issues are not tracked in the code-review debt section of TODO.md:L38, so they remain unmitigated for assessment.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests would not catch either P1. test_twfe() only asserts that the call returns a finite-looking result, not that the TWFE registry path is methodologically valid, and the sample-size tests never cover the case where the lower search bound already achieves target power. The simulate_power() docstring also overstates support by saying all built-in estimators are automatically supported. Concrete fix: add regression tests that either reject unsupported multi-period TWFE power simulation or validate a corrected two-period TWFE path, plus tests where simulate_sample_size() must return the lower endpoint for both explicit n_range and auto-bracketed strong-effect cases; narrow the docstring claim until that coverage exists. Evidence: tests/test_power.py:L844, tests/test_power.py:L1177, tests/test_power.py:L1201, diff_diff/power.py:L1224

Path to Approval

  1. Restrict or redesign automatic TWFE power support so it does not run the current multi-period time="period" path and report it as standard ATT power.
  2. Fix simulate_sample_size() to use both bracket endpoints and short-circuit when the lower bound already meets target power.
  3. Add regression tests for the two cases above, including an explicit lower-bound-sufficient sample-size search.

Static diff review only. I could not execute the test suite in this environment because the local Python environment is missing numpy.

igerber added a commit that referenced this pull request Mar 18, 2026
…short-circuit

- Remove TwoWayFixedEffects from power analysis registry (time="period"
  produces treated*period_number, not standard ATT)
- Add early return in simulate_sample_size() when lower bound already
  achieves target power (both explicit n_range and auto-bracket paths)
- Narrow docstring from "All" to "Most" built-in estimators
- Add regression tests for TWFE exclusion and lo-sufficient scenarios

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: d79b042bd80e8d8a1afa453ae8397f4fb2e95b37


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

The two prior blockers from the last review appear resolved: TwoWayFixedEffects is no longer auto-registered for simulation power, and simulate_sample_size() now short-circuits when the lower bound already meets target power, with regression coverage for both cases diff_diff/power.py:L249-L331, tests/test_power.py:L803-L813, diff_diff/power.py:L1972-L2003, tests/test_power.py:L1229-L1256. One new P1 remains.

Executive Summary

Methodology

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings in the changed code.

Maintainability

No material findings.

Tech Debt

No separate deferred-work findings. The remaining P1 is not tracked under the code-review debt section of TODO.md:L38-L64, so it remains unmitigated for assessment.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new test coverage does not exercise the public default SyntheticDiD auto-support path. test_synthetic_did() only uses treatment_fraction=0.3, and the new MDE/sample-size tests only cover DifferenceInDifferences, so the current default-path regression would not be caught by the suite. Concrete fix: add regression tests for simulate_power(SyntheticDiD(), progress=False) at defaults, plus corresponding simulate_mde() / simulate_sample_size() coverage, and assert either valid finite inference or an explicit fail-fast contract. References: tests/test_power.py:L959-L970, tests/test_power.py:L978-L1103, tests/test_power.py:L1111-L1256.

Path to Approval

  1. Make the automatic SyntheticDiD simulation path valid or explicitly unsupported. Concretely: detect the default placebo-variance case where n_control <= n_treated and either raise with a clear message, or switch the supported automatic configuration to one that yields valid inference.
  2. Add regression tests covering the public default SyntheticDiD path in simulate_power(), and the same configuration through simulate_mde() and simulate_sample_size().

Static review only. I could not run the test suite in this environment because pytest is unavailable, and local Python imports fail because numpy is not installed.

igerber added a commit that referenced this pull request Mar 18, 2026
…ro-power bug

Add _sdid_dgp_kwargs that caps treatment_fraction at 0.4 so the placebo
variance method has enough pseudo-controls (n_control > n_treated).
Add regression test for the default SyntheticDiD path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 32f2b2b52bb09b1d35bcbe6a9c5b5f8dab5e3ee8


Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0.

Executive Summary

Methodology

Code Quality

No material findings beyond the methodology blocker above.

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the test gap below.

Tech Debt

No relevant entry under TODO.md:L38-L64 mitigates the blocker above, so it remains unmitigated for assessment.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests validate that the default SyntheticDiD path no longer returns zero power, but they do not verify that the requested design was preserved, and they do not exercise SyntheticDiD through simulate_mde() or simulate_sample_size(). That is why the silent treatment_fraction rewrite passes the suite. Concrete fix: add public-API regression tests that assert either (a) the realized treated share matches the requested treatment_fraction, or (b) unsupported placebo designs raise/warn explicitly; run the same contract through simulate_power(), simulate_mde(), and simulate_sample_size() for SyntheticDiD. References: tests/test_power.py:L697-L729, tests/test_power.py:L975-L987, tests/test_power.py:L995-L1269.

Path to Approval

  1. Remove the unconditional 40% cap from _sdid_dgp_kwargs() so the simulation respects the caller’s treatment_fraction.
  2. Add an explicit contract for unsupported SyntheticDiD placebo designs: if n_control <= n_treated, raise with a clear message or require an explicit bootstrap configuration, rather than silently changing the design.
  3. Add regression coverage for SyntheticDiD across simulate_power(), simulate_mde(), and simulate_sample_size() that checks the contract above.

Static review only. I could not run pytest here because pytest is unavailable, and local imports fail because numpy is not installed.

igerber added a commit that referenced this pull request Mar 18, 2026
…n cap with fail-fast validation

Remove _sdid_dgp_kwargs that silently capped treatment_fraction at 0.4.
Add upfront ValueError in simulate_power() when SyntheticDiD placebo
variance has n_control <= n_treated, with actionable error message
suggesting lowering treatment_fraction or using bootstrap variance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 4fe4a7ab2714d25bd64c6742359b04631c77e2e1


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

  • The prior SyntheticDiD blocker is resolved: the code now fails fast for placebo designs with n_control <= n_treated, the behavior is documented in the Methodology Registry, and tests cover simulate_power(), simulate_mde(), and simulate_sample_size().
  • [Newly identified] The registry refactor leaves TwoWayFixedEffects outside the automatic power-analysis path, so the default simulation APIs now error for a first-class public estimator. The new tests explicitly lock that unsupported state in.
  • The new "provide a custom data_generator and estimator_kwargs" fallback is not actually generic. Unregistered estimators with non-basic fit() signatures, such as ContinuousDiD, still receive hard-coded basic-DiD kwargs and cannot use the advertised escape hatch.
  • Test coverage is broad for registered estimators and the SyntheticDiD fail-fast path, but it adds no positive regression coverage for TwoWayFixedEffects and no test of the advertised custom fallback for non-basic estimators.

Methodology

Code Quality

  • Severity: P1. Impact: TwoWayFixedEffects is still exported as a core public estimator, but the new registry omits it and simulate_power() now raises "not in registry"; simulate_mde() and simulate_sample_size() inherit the same failure. The suite then codifies that unsupported state. This is an undocumented contraction of the default power-analysis API introduced by the refactor. Concrete fix: add a dedicated TwoWayFixedEffects profile to _get_registry() and cover all three public simulation entry points. When you add it, use a binary post indicator in the fit kwargs (time="post", unit="unit"), not the old time="period" pattern, because the registry explicitly says multi-valued time does not estimate the standard TWFE ATT. References: diff_diff/init.py:L37-L42, diff_diff/init.py:L188-L200, diff_diff/power.py:L249-L331, diff_diff/power.py:L1298-L1307, docs/methodology/REGISTRY.md:L253-L257, tests/test_power.py:L803-L814

Performance

No material findings in the changed code.

Maintainability

  • Severity: P2. Impact: The new fallback contract for unregistered estimators is incomplete. The error path tells users to "Provide a custom data_generator and estimator_kwargs," but when profile is None the code still injects treatment="treated" and time="post" unconditionally. Estimators with non-basic fit() signatures, such as ContinuousDiD.fit(data, outcome, unit, time, first_treat, dose, ...), cannot actually use the advertised escape hatch. Concrete fix: either add a truly generic fit_kwargs/fit_kwargs_builder override that lets callers fully control the fit signature, or raise a targeted error for non-basic unregistered estimators instead of suggesting a path that still fails. References: diff_diff/power.py:L1302-L1307, diff_diff/power.py:L1408-L1418, diff_diff/continuous_did.py:L153-L161

Tech Debt

No relevant TODO.md entry mitigates the P1 or P2 findings above. Reference: TODO.md:L38-L64

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests institutionalize TwoWayFixedEffects as unsupported instead of guarding against the regression, and they do not exercise the advertised custom-generator fallback for an unregistered non-basic estimator. That leaves both API-contract issues above unguarded. Concrete fix: replace the negative TwoWayFixedEffects test with positive coverage for simulate_power(), simulate_mde(), and simulate_sample_size(), and add one regression test that either proves the custom fallback works for a non-basic estimator or asserts a targeted "unsupported fit signature" error. References: tests/test_power.py:L803-L814, tests/test_power.py:L834-L1067, tests/test_power.py:L1075-L1352

Path to Approval

  1. Add a TwoWayFixedEffects registry entry so the default simulation APIs cover that public estimator.
  2. Make the unregistered-estimator fallback honest: either support fully caller-specified fit kwargs or fail with a targeted error instead of hard-coding basic-DiD kwargs.
  3. Add regression coverage for TwoWayFixedEffects across simulate_power(), simulate_mde(), and simulate_sample_size(), plus one test covering the intended behavior for an unregistered non-basic estimator.

Static review only. I did not execute the test suite in this environment.

igerber added a commit that referenced this pull request Mar 18, 2026
…tor fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d2fdaf892bbe20fd99ce97b96ba339bc9121ff3


Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Static review only. I could not execute the test suite in this environment because pytest and the scientific Python deps are not installed.

Executive Summary

  • The prior SyntheticDiD blocker is resolved: the simulation APIs now fail fast when placebo variance would violate n_control > n_treated, that deviation is documented in the Methodology Registry, and tests cover simulate_power(), simulate_mde(), and simulate_sample_size().
  • The prior TwoWayFixedEffects blocker is resolved: TWFE is back in the estimator registry, and the registry now uses a binary post indicator, which matches the Registry guidance for standard TWFE ATT.
  • The prior “custom fallback” concern is only partially resolved: caller-controlled estimator_kwargs fixed the old fit-signature problem, but unregistered estimators with nonstandard result objects still cannot use the advertised escape hatch.
  • The power docs remain stale and unguarded: docs/api/power.rst still documents nonexistent PowerAnalysis.compute_* methods and old simulate_power() arguments, and the new snippet smoke test never exercises that file.
  • The new TODO.md entry correctly tracks the remaining blanket-NameError limitation in doc-snippet smoke tests, so that item is informational only.

Methodology

Code Quality

  • No material findings.

Performance

  • No material findings.

Maintainability

  • Severity: P2. Impact: The advertised unregistered-estimator fallback is still not truly generic. When profile is None, simulate_power() now lets callers fully control estimator.fit() kwargs, but result extraction still assumes either .att/.se/.p_value/.conf_int or .avg_*. Public unregistered estimators with different result schemas, such as ContinuousDiD (overall_att, overall_att_se, overall_att_p_value, overall_att_conf_int), will still fail even if the caller supplies a correct custom DGP wrapper and fit kwargs. That means the previous “generic fallback” concern is only partially resolved. Concrete fix: add a caller-supplied result_extractor hook, or extend the fallback extractor to handle overall_* result objects; then add a regression test using ContinuousDiD with explicit data_generator and estimator_kwargs. References: diff_diff/power.py:L1318-L1446, diff_diff/continuous_did.py:L153-L161, diff_diff/continuous_did_results.py:L112-L118, tests/test_power.py:L1097-L1138

Tech Debt

  • Severity: P3. Impact: tests/test_doc_snippets.py still treats any NameError as a pass, which can mask standalone snippet breakage, but this PR explicitly tracks that limitation in TODO.md, so it is mitigated and non-blocking under the review policy. Concrete fix: none required for approval; future work is to replace the blanket NameError catch with a per-snippet allowlist or explicit fixture setup. References: tests/test_doc_snippets.py:L357-L363, TODO.md:L62-L65

Security

  • No material findings.

Documentation/Tests

  • Severity: P2. Impact: The power documentation is still out of sync with the implementation, and the new smoke-test coverage does not protect it. docs/api/power.rst still advertises nonexistent PowerAnalysis.compute_* methods and stale simulate_power() parameters, while the newly exported simulate_mde, simulate_sample_size, SimulationMDEResults, and SimulationSampleSizeResults are absent from the API index. Because tests/test_doc_snippets.py does not include docs/api/power.rst, this broken documentation path remains untested. Concrete fix: update docs/api/power.rst to the current API (power, mde, sample_size, simulate_mde, simulate_sample_size, and both new result classes), add the new symbols to docs/api/index.rst and autosummary, and include api/power.rst in RST_FILES. References: docs/api/power.rst:L33-L65, docs/api/power.rst:L145-L160, docs/api/index.rst:L148-L155, tests/test_doc_snippets.py:L25-L40, diff_diff/init.py:L53-L64, diff_diff/init.py:L295-L306

@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

igerber and others added 7 commits March 18, 2026 15:00
…sample size

- Add estimator registry mapping all 12 estimators to appropriate DGP,
  fit kwargs, and result extraction profiles
- Refactor simulate_power() to use registry lookup instead of hardcoded
  if/elif chain (was 4 estimators, now 12 + custom fallback)
- Add simulate_mde() for bisection search over effect sizes
- Add simulate_sample_size() for bisection search over n_units
- Add SimulationMDEResults and SimulationSampleSizeResults dataclasses
- 34 new tests: registry, estimator coverage, MDE search, sample size search

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bisection brackets

P0: Remove ContinuousDiD from power analysis registry — the DGP simulated
att_slope as treatment_effect but extracted overall_att (a dose-weighted
average), so the simulated truth and evaluated estimand didn't match.

P1: Add endpoint evaluation for user-supplied effect_range/n_range in
simulate_mde() and simulate_sample_size(). Previously these paths skipped
evaluation entirely, leaving search_path empty and best_power at 0.0.
Now warns when brackets don't contain the target power.

P2: Add regression tests for all three fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…short-circuit

- Remove TwoWayFixedEffects from power analysis registry (time="period"
  produces treated*period_number, not standard ATT)
- Add early return in simulate_sample_size() when lower bound already
  achieves target power (both explicit n_range and auto-bracket paths)
- Narrow docstring from "All" to "Most" built-in estimators
- Add regression tests for TWFE exclusion and lo-sufficient scenarios

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ro-power bug

Add _sdid_dgp_kwargs that caps treatment_fraction at 0.4 so the placebo
variance method has enough pseudo-controls (n_control > n_treated).
Add regression test for the default SyntheticDiD path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n cap with fail-fast validation

Remove _sdid_dgp_kwargs that silently capped treatment_fraction at 0.4.
Add upfront ValueError in simulate_power() when SyntheticDiD placebo
variance has n_control <= n_treated, with actionable error message
suggesting lowering treatment_fraction or using bootstrap variance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tor fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… API symbols

- Add `result_extractor` parameter to simulate_power, simulate_mde, and
  simulate_sample_size for unregistered estimators with non-standard schemas
- Fix power.rst: correct PowerAnalysis method names, example code, and add
  SimulationMDEResults/SimulationSampleSizeResults/simulate_mde/simulate_sample_size
- Add 4 missing symbols to docs/api/index.rst autosummary
- Add api/power.rst to doc snippet smoke tests
- Add tests for custom result_extractor and MDE forwarding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: a15ce5a4d7a4a318b9edb024152b61cf8307004c


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review against the provided diff and surrounding source context; I did not execute the test suite in this environment.

Executive Summary

  • Prior blockers from the last review appear addressed: TwoWayFixedEffects is back in the power registry with a binary post path, and the SyntheticDiD placebo fail-fast behavior is now documented in the Methodology Registry.
  • One unmitigated P1 remains in the new automatic power-analysis path: the staggered-estimator registry hardcodes a single-cohort + never-treated DGP and does not adapt to estimator-level settings like control_group, clean_control, or anticipation.
  • That means simulate_power(), and by extension simulate_mde() / simulate_sample_size(), can silently report power for a different design than the configured estimator object the caller passed.
  • The new power tests mostly cover default estimator instances, so those parameter-dependent staggered paths are not exercised.
  • The doc-snippet NameError masking limitation is now tracked in TODO.md, so it is informational only.

Methodology

Code Quality

  • No material findings in the changed code beyond the methodology issue above.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • Severity: P3. Impact: the new doc-snippet smoke test still treats any NameError as a pass, which can mask standalone example breakage, but this PR also adds that limitation to TODO.md, so it is properly tracked and non-blocking under the review policy. Concrete fix: none required for approval; later, replace blanket NameError suppression with per-snippet setup or an allowlist. References: tests/test_doc_snippets.py:L357-L363, TODO.md:L60-L65

Security

  • No material findings.

Documentation/Tests

  • Severity: P2. Impact: the new estimator-coverage tests only instantiate default estimator objects, so the regression suite never exercises the new auto-DGP path for control_group="not_yet_treated", clean_control="strict"/"not_yet_treated", or anticipation>0. That leaves the methodology mismatch above unguarded. Concrete fix: add explicit simulation tests for those configurations across simulate_power(), simulate_mde(), and simulate_sample_size(). References: tests/test_power.py:L828-L1095

Path to Approval

  1. Update the staggered power registry so the default DGP matches estimator settings that change identification/comparison design (control_group, clean_control, cohort structure, anticipation), or refuse those configurations unless callers supply matching data_generator_kwargs / data_generator.
  2. Add regression tests for at least CallawaySantAnna(control_group="not_yet_treated") with data_generator_kwargs={"never_treated_frac": 0.0, "cohort_periods": [2, 4]}, StackedDiD(clean_control="strict", kappa_pre=1, kappa_post=1) with later-treated clean controls, and one anticipation>0 staggered estimator case where the DGP matches the shifted effect onset/reference period.
  3. If any automatic-simulation limitation remains intentional, document it in docs/methodology/REGISTRY.md with a Note or Deviation label and emit a runtime warning when the registry default DGP is not faithful to the configured estimator.

Add _check_staggered_dgp_compat() that detects control_group="not_yet_treated",
clean_control="strict", and anticipation>0 on staggered estimators using the
automatic single-cohort DGP, and emits UserWarning with specific guidance on
supplying data_generator_kwargs or a custom data_generator.

- 5 warning/no-warning unit tests + 2 slow regression tests with matching DGP
- Document limitation in REGISTRY.md PowerAnalysis section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 800d37f224f76b2218d164098a3fdec8d3797e6d


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review of the provided diff and surrounding source; I did not execute the test suite in this environment because the package import fails without numpy.

Executive Summary

  • The prior staggered-registry blocker is mitigated: the single-cohort auto-DGP limitation is now documented in the Methodology Registry, surfaced as UserWarnings, and covered by regression tests.
  • One P1 remains in the new SyntheticDiD power path: the placebo-variance fail-fast ignores data_generator_kwargs overrides that change the effective treated/control split.
  • That allows simulate_power(), simulate_mde(), and simulate_sample_size() to return numeric output for an infeasible placebo design instead of raising the documented ValueError.
  • The new SDID tests only cover the treatment_fraction boundary, so the override-based regression is not exercised.
  • I did not find additional material issues in code quality, performance, maintainability, security, or docs beyond that gap.

Methodology

Code Quality

No material findings in the changed code beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

No material findings newly introduced by this diff.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new SDID regression coverage only checks the placebo boundary through treatment_fraction; it never exercises the supported data_generator_kwargs override path that changes the actual treated count, so the P1 above shipped untested. Concrete fix: add simulate_power(), simulate_mde(), and simulate_sample_size() tests using SyntheticDiD(), n_units=50, treatment_fraction=0.3, and data_generator_kwargs={"n_treated": 30} (or any override that makes n_control <= n_treated) and assert the same ValueError as the high-treatment_fraction path. References: tests/test_power.py:L968-L1030, diff_diff/power.py:L1409-L1500.

Path to Approval

  1. Move the SyntheticDiD placebo-feasibility validation to the effective DGP parameters after data_generator_kwargs are merged, so registry-based overrides of n_treated cannot bypass the documented ValueError.
  2. Add regression tests for all three simulation helpers covering the override case, not just the treatment_fraction case.
  3. If the fail-fast guarantee is intentionally limited to the default factor DGP, narrow the REGISTRY.md note/error text accordingly; otherwise keep the current note and make the validation match it.

… override

Move the SyntheticDiD placebo-variance fail-fast after kwargs merge so that
data_generator_kwargs={"n_treated": N} overrides are caught. Add tests for
the override path across simulate_power, simulate_mde, and simulate_sample_size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 9e18a3ea05f874da0a9bc63f1f66fc0eb5f2f343


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review only; I could not execute the test suite in this environment because pytest and numpy are unavailable.

Executive Summary

Methodology

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings beyond the need to make the TripleDifference adapter semantics explicit.

Tech Debt

No separate tech-debt-only findings. The TripleDifference adapter issue above is not tracked in TODO.md, so it remains unmitigated for approval purposes. See TODO.md:L38.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new test coverage does not exercise the DDD adapter’s parameter translation edge cases. TripleDifference is only tested in simulate_power() with n_units=80, which happens to map cleanly to 10 observations per cell, and there are no DDD-specific simulate_mde() or simulate_sample_size() cases. That leaves the P1 above unguarded. Concrete fix: add TripleDifference regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() with values like n_units=10 and n_units=65, and assert the intended behavior for unsupported n_periods/treatment_period/treatment_fraction and for effective sample-size handling. See tests/test_power.py:L930, tests/test_power.py:L1353, tests/test_power.py:L1486.

Path to Approval

  1. Make the TripleDifference registry path enforce a one-to-one mapping between public simulation inputs and the actual DDD DGP: validate or reject unsupported n_periods, treatment_period, and treatment_fraction, and stop silently rounding n_units without surfacing the effective design.
  2. Add DDD-specific regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() covering non-multiples of 8 and unsupported generic design knobs.
  3. If any DDD-specific rounding or fixed-2x2x2 behavior is intentional, document it in docs/methodology/REGISTRY.md with a **Note:** and mirror that behavior in the public docstrings/results so the simulated design is explicit to callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 0e912bbed172b3958c4e87088e35665c9a0b17c5


Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only; I could not execute pytest in this environment because both pytest and numpy are unavailable.

Executive Summary

Methodology

  • Severity: P0. Impact: simulate_sample_size() and SimulationSampleSizeResults.required_n are documented as returning the required/smallest sample size, but the auto path seeds lo from an internal registry floor and returns immediately if that floor already reaches target power. Because that branch never evaluates any n < min_n, it can silently report a non-minimal required_n. This is a wrong statistical output, not just a conservative implementation choice. Concrete fix: use min_n only as an initial guess for bracketing, then search downward to the estimator’s true feasibility floor before returning; if you intend to return an upper bound instead, rename/document the result accordingly and emit a warning. See diff_diff/power.py:L1774-L1777, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2132-L2204, tests/test_power.py:L1747-L1758.
  • Severity: P1. Impact: _check_ddd_dgp_compat() suppresses all ignored-parameter warnings whenever n_per_cell is present in data_generator_kwargs, even though overriding n_per_cell does not make n_periods, treatment_period, or treatment_fraction meaningful to generate_ddd_data(). That leaves a silent design-mismatch path for DDD simulations and conflicts with the registry note promising a warning whenever inputs differ from the effective design. Concrete fix: let n_per_cell suppress only the effective-N rounding/clamping warning; keep warning on ignored n_periods, treatment_period, and treatment_fraction unless the caller supplies a fully custom data_generator. See diff_diff/power.py:L313-L359, docs/methodology/REGISTRY.md:L1721, tests/test_power.py:L1023-L1035.

Code Quality

No material findings beyond the correctness issues above.

Performance

No material findings.

Maintainability

No material findings beyond the need to separate “initial search guess” from “true minimum supported N” in the sample-size search path.

Tech Debt

  • Severity: P3. Impact: The earlier TripleDifference power-analysis limitation is now properly tracked in TODO.md and documented in REGISTRY.md, so that specific deviation is mitigated under the project’s review policy. Concrete fix: none for this PR. See TODO.md:L42-L50, docs/methodology/REGISTRY.md:L1720-L1721.

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests lock in the two behaviors above instead of the documented contract: test_lo_already_sufficient_auto asserts the auto path should return the registry floor as required_n, and test_ddd_no_warn_n_per_cell_override asserts no warning in a still-ignored-parameter DDD path. That will hide the fixes behind failing tests. Concrete fix: update those assertions to match the corrected behavior, and add a DDD regression where n_per_cell is overridden but mismatched n_periods/treatment_fraction still warn. See tests/test_power.py:L1023-L1035, tests/test_power.py:L1747-L1758.

Path to Approval

  1. Fix simulate_sample_size() so the auto path does not return min_n as required_n without ruling out smaller sample sizes; either search downward after the first successful bracket or explicitly convert the API/result semantics to “upper bound” with a warning.
  2. Fix _check_ddd_dgp_compat() so n_per_cell only suppresses the effective-sample-size warning, not the warnings for ignored n_periods, treatment_period, and treatment_fraction.
  3. Update the new tests to reflect the corrected contracts, especially tests/test_power.py:L1023-L1035 and tests/test_power.py:L1747-L1758.

… warning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: fa93beae35118378807aacbcecad8c3522477b21


Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only. I could not run pytest or import the package in this environment because pytest and numpy are unavailable.

Executive Summary

Methodology

  • Severity: P0. Impact: On the registered-estimator path, dgp_kwargs.update(data_gen_kwargs) silently lets data_generator_kwargs override builder-supplied core simulation inputs. A conflicting key such as treatment_effect will make simulate_power() report one true_effect while generating data with another, which in turn corrupts bias, coverage, and any simulate_mde() / simulate_sample_size() search that relies on simulate_power() as its objective. Before this refactor, duplicate kwargs on the explicit call path would have raised immediately; now they silently change the estimand/DGP. Concrete fix: reject conflicting overrides on the registry path for core/search-driving inputs instead of blindly update()-ing, with a narrow allowlist only for documented, intentional overrides such as DDD n_per_cell. Add regression tests for collisions on treatment_effect and the searched quantity. Location: diff_diff/power.py:L1554-L1564, diff_diff/power.py:L1645-L1664, diff_diff/power.py:L1933-L1952, diff_diff/power.py:L2137-L2156
  • Severity: P1. Impact: simulate_sample_size() still claims to find the “smallest N” / “required sample size,” but in auto mode it returns the registry floor as soon as that floor reaches target power, without evaluating any smaller n. The added warning helps, but the returned required_n semantics remain incorrect and undocumented in the registry. Concrete fix: either actually search below the registry floor until infeasibility is established, or rename/document the auto-path result everywhere as an upper bound/search floor rather than required_n. Location: diff_diff/power.py:L1774-L1776, diff_diff/power.py:L1818-L1828, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2191-L2209

Code Quality

No material findings beyond the correctness issues above.

Performance

No material findings.

Maintainability

No material findings beyond the unresolved required_n contract mismatch in simulate_sample_size().

Tech Debt

  • Severity: P3. Impact: The TripleDifference power-analysis limitation is now properly mitigated under project policy: the fixed 2×2×2 DGP is documented in the Methodology Registry and tracked in TODO.md, so that deviation is informational rather than blocking. Concrete fix: none for this PR. Location: docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L46-L50

Security

No material findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests now lock in the auto-floor behavior by asserting that simulate_sample_size() should return 20 in the “registry floor already sufficient” case, and there is no regression coverage for the new duplicate-core-kwarg collision path on registered estimators. That means the remaining correctness issues will look intentional to the suite. Concrete fix: update the auto-floor expectation once simulate_sample_size() semantics are corrected, and add explicit tests that conflicting registry-path overrides such as data_generator_kwargs={"treatment_effect": ...} raise instead of silently changing the simulated DGP. Location: tests/test_power.py:L1762-L1774, diff_diff/power.py:L1554-L1564

Path to Approval

  1. Prevent silent core-DGP overrides on the registry path. Reject conflicting data_generator_kwargs that would replace builder-supplied simulation knobs or search variables, and add regression tests covering simulate_power(), simulate_mde(), and simulate_sample_size().
  2. Make simulate_sample_size() consistent with its contract: either search below the registry floor until the true minimum is identified, or explicitly change the API/result/docs to return a documented upper bound/search floor instead of required_n.
  3. Update the tests to match the corrected sample-size semantics and to cover the duplicate-kwarg regression path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant